-
Notifications
You must be signed in to change notification settings - Fork 3
Fix parallel execution for file based executor #760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes extend the Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant SharedPy
participant CommandPy
Caller->>SharedPy: execute_tasks_h5(file_name, cores, backend, pmi_mode)
SharedPy->>CommandPy: get_cache_execute_command(file_name, cores, backend, pmi_mode)
CommandPy-->>SharedPy: command_list
SharedPy-->>Caller: command_list
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
executorlib/standalone/command.py (1)
36-55: Consider refactoring for better maintainability.The backend logic is functionally correct and handles all cases properly. However, the implementation is repetitive and could be improved for maintainability.
Consider refactoring to reduce code duplication:
- if backend is None: - command_lst = ( - ["mpiexec", "-n", str(cores)] - + command_lst - + [get_command_path(executable="cache_parallel.py"), file_name] - ) - elif backend == "slurm": - command_lst = ( - ["srun", "-n", str(cores)] - + command_lst - + [get_command_path(executable="cache_parallel.py"), file_name] - ) - elif backend == "flux": - command_lst = ( - ["flux", "run", "-n", str(cores)] - + command_lst - + [get_command_path(executable="cache_parallel.py"), file_name] - ) - else: - raise ValueError(f"backend should be None, slurm or flux, not {backend}") + backend_commands = { + None: ["mpiexec", "-n", str(cores)], + "slurm": ["srun", "-n", str(cores)], + "flux": ["flux", "run", "-n", str(cores)], + } + + if backend not in backend_commands: + raise ValueError(f"backend should be None, slurm or flux, not {backend}") + + command_lst = ( + backend_commands[backend] + + command_lst + + [get_command_path(executable="cache_parallel.py"), file_name] + )This approach centralizes the backend configuration and makes it easier to add new backends in the future.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
executorlib/standalone/command.py(2 hunks)executorlib/task_scheduler/file/shared.py(1 hunks)tests/test_standalone_command.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: unittest_slurm_mpich
- GitHub Check: notebooks_integration
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
- GitHub Check: unittest_flux_mpich
- GitHub Check: unittest_flux_openmpi
- GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
- GitHub Check: unittest_mpich (macos-latest, 3.13)
- GitHub Check: unittest_win
🔇 Additional comments (7)
executorlib/standalone/command.py (2)
4-4: LGTM: Proper type import for optional parameter.The
Optionalimport is correctly added to support the new optionalbackendparameter type hint.
20-22: LGTM: Well-designed function signature enhancement.The function signature properly adds the optional
backendparameter with appropriate type hints and maintains backward compatibility. The docstring clearly documents the new parameter and its accepted values.Also applies to: 29-29
executorlib/task_scheduler/file/shared.py (1)
157-157: LGTM: Proper parameter propagation.The
backendparameter is correctly passed through toget_cache_execute_command, enabling the backend-specific command generation functionality introduced in the command module.tests/test_standalone_command.py (4)
1-6: LGTM: Clean test structure and imports.The test imports are appropriate and the test class follows standard unittest conventions.
7-15: LGTM: Good coverage of interactive command generation.The tests properly verify that the correct scripts are selected for serial vs parallel execution. Using path splitting to check script names is a good approach for cross-platform compatibility.
17-30: LGTM: Comprehensive serial execution testing.The tests correctly verify that backend parameters have no effect on serial execution (cores=1), testing multiple backend values to ensure consistent behavior. This validates that the backend logic only applies to parallel execution.
32-57: Excellent comprehensive testing of the new backend functionality.The parallel execution tests thoroughly validate all supported backends with complete command structure verification. The tests properly check:
- Correct backend commands (mpiexec, srun, flux run)
- Proper argument ordering and values
- Error handling for invalid backends
This provides excellent coverage of the new functionality and will catch potential regressions.
|
It is surprising that the tests work with flux and mpich but they fail for flux with openmpi and slurm with mpich. The issue with flux and openmpi could be related to the pmix configuration, but then at least the slurm with mpich test should work as well. |
for more information, see https://pre-commit.ci
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #760 +/- ##
==========================================
+ Coverage 97.43% 97.47% +0.03%
==========================================
Files 32 32
Lines 1403 1423 +20
==========================================
+ Hits 1367 1387 +20
Misses 36 36 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…allel_file_executor
for more information, see https://pre-commit.ci
fdd9c5d to
2545358
Compare
|
There seems to be a general issue with the SLURM tests, so fixing those is the next step. |
* Debug slurm pmi options * Update pipeline.yml * disable MPI parallel test * sinfo * update output * add mpi parallel test * core validation is only possible on compute node not on login node * fix test * enforce pmix for testing * Update slurmspawner.py * downgrade to openmpi * remove pmi setting * use slurm environment * Try pmix again * Update slurmspawner.py * Update slurmspawner.py * Update pipeline.yml * Update pipeline.yml * Update slurmspawner.py * Update pipeline.yml * use slrum args * extend test * fix tests * Add executor_pmi_mode option * check all files * fixes * extend tests * rename to pmi_mode * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
.github/workflows/pipeline.yml (1)
283-284: LGTM! Enhanced MPI environment setup and debugging.The explicit installation of
mpichand the addition of debugging commands (sinfoandsrun --mpi=list) will help diagnose MPI configuration issues. The test reordering also appears sensible.Consider adding error handling for the debugging commands to prevent workflow failures if they encounter issues:
- sinfo -o "%n %e %m %a %c %C" - srun --mpi=list + sinfo -o "%n %e %m %a %c %C" || echo "sinfo command failed" + srun --mpi=list || echo "srun --mpi=list command failed"This would ensure the workflow continues even if these diagnostic commands fail, while still providing valuable debugging information when they succeed.
Also applies to: 300-303
notebooks/3-hpc-job.ipynb (1)
1-501: Address static analysis issues in the notebook.The static analysis tools have identified several issues that should be addressed:
- Unused import
executorlib.SlurmJobExecutoron line 2- Module-level imports not at the top of file (lines 16, 65-66, 93-94)
- Use
contextlib.suppress(OSError)instead of try-except-pass (lines 71-74)- Use single
withstatement for multiple contexts (lines 96-97)For the unused import, apply this diff:
-from executorlib import SlurmJobExecutorFor the try-except block, consider:
- try: - shutil.rmtree(cache_dir) - except OSError: - pass + with contextlib.suppress(OSError): + shutil.rmtree(cache_dir)executorlib/task_scheduler/interactive/slurmspawner.py (1)
97-97: LGTM! Consider using list append for consistency.The PMI mode parameter addition and conditional logic are correct. For consistency with the rest of the function, consider using the list append pattern:
- if pmi_mode is not None: - command_prepend_lst += ["--mpi=" + pmi_mode] + if pmi_mode is not None: + command_prepend_lst += ["--mpi", pmi_mode]This approach is more consistent with other arguments in the function and avoids string concatenation.
Also applies to: 111-111, 119-120
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
.github/workflows/pipeline.yml(2 hunks)docs/installation.md(1 hunks)executorlib/executor/flux.py(14 hunks)executorlib/executor/single.py(1 hunks)executorlib/executor/slurm.py(12 hunks)executorlib/standalone/command.py(2 hunks)executorlib/standalone/inputcheck.py(1 hunks)executorlib/task_scheduler/file/shared.py(3 hunks)executorlib/task_scheduler/file/task_scheduler.py(7 hunks)executorlib/task_scheduler/interactive/fluxspawner.py(4 hunks)executorlib/task_scheduler/interactive/slurmspawner.py(7 hunks)notebooks/3-hpc-job.ipynb(1 hunks)tests/test_fluxclusterexecutor.py(4 hunks)tests/test_fluxjobexecutor.py(2 hunks)tests/test_fluxpythonspawner.py(2 hunks)tests/test_slurmclusterexecutor.py(5 hunks)tests/test_slurmjobexecutor.py(1 hunks)tests/test_standalone_command.py(1 hunks)tests/test_standalone_inputcheck.py(2 hunks)tests/test_standalone_interactive_backend.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- docs/installation.md
🚧 Files skipped from review as they are similar to previous changes (2)
- executorlib/task_scheduler/file/shared.py
- tests/test_standalone_command.py
🧰 Additional context used
🧬 Code Graph Analysis (3)
executorlib/executor/flux.py (1)
executorlib/standalone/inputcheck.py (1)
check_pmi(104-116)
tests/test_standalone_inputcheck.py (1)
executorlib/standalone/inputcheck.py (1)
check_pmi_mode(149-153)
executorlib/task_scheduler/file/task_scheduler.py (1)
executorlib/standalone/inputcheck.py (1)
check_pmi_mode(149-153)
🪛 Ruff (0.12.2)
notebooks/3-hpc-job.ipynb
2-2: executorlib.SlurmJobExecutor imported but unused
Remove unused import: executorlib.SlurmJobExecutor
(F401)
16-16: Module level import not at top of file
(E402)
65-65: Module level import not at top of file
(E402)
66-66: Module level import not at top of file
(E402)
71-74: Use contextlib.suppress(OSError) instead of try-except-pass
Replace with contextlib.suppress(OSError)
(SIM105)
93-93: Module level import not at top of file
(E402)
94-94: Module level import not at top of file
(E402)
96-97: Use a single with statement with multiple contexts instead of nested with statements
(SIM117)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
- GitHub Check: unittest_openmpi (macos-latest, 3.13)
- GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
- GitHub Check: unittest_win
- GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
- GitHub Check: unittest_mpich (macos-latest, 3.13)
- GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
- GitHub Check: notebooks_integration
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
- GitHub Check: unittest_flux_openmpi
- GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
- GitHub Check: unittest_slurm_mpich
- GitHub Check: unittest_flux_mpich
🔇 Additional comments (34)
executorlib/executor/single.py (1)
332-332: Parameter rename aligns with codebase refactoring.The change from
flux_executor_pmi_modetopmi_modeis consistent with the broader refactoring effort to standardize PMI interface parameter naming across the codebase.tests/test_fluxjobexecutor.py (1)
93-93: Test parameter updates maintain coverage with new naming convention.The parameter rename from
flux_executor_pmi_modetopmi_modein both test methods correctly maintains test coverage for MPI functionality while adopting the standardized parameter naming.Also applies to: 105-105
tests/test_fluxpythonspawner.py (1)
85-85: Dictionary key updates align with parameter standardization.The parameter key changes from
"flux_executor_pmi_mode"to"pmi_mode"in the executor_kwargs dictionaries correctly reflect the standardized parameter naming while maintaining test functionality.Also applies to: 99-99
tests/test_slurmclusterexecutor.py (2)
29-29: SBATCH directive change improves MPI job specification.The change from
--cpus-per-task={{cores}}to--ntasks={{cores}}is appropriate for MPI jobs, as--ntaskscorrectly specifies the number of MPI processes rather than CPUs per task.
53-53: PMI mode parameter addition enhances MPI compatibility.The addition of
pmi_mode="pmi2"parameter to all SlurmClusterExecutor test cases aligns with the codebase standardization effort and provides explicit PMI interface specification for better MPI compatibility.Also applies to: 67-67, 81-81, 100-100
tests/test_fluxclusterexecutor.py (1)
44-44: PMI mode parameter addition standardizes test configuration.The addition of
pmi_mode=pmiparameter to all FluxClusterExecutor test cases properly standardizes PMI interface specification across the test suite, using the environment-based configuration for runtime flexibility.Also applies to: 58-58, 86-86, 105-105
tests/test_standalone_interactive_backend.py (1)
87-87: LGTM! Test correctly updated for PMI mode support.The test properly validates that the
pmi_mode="pmi2"parameter results in the--mpi=pmi2flag being included in the generated Slurm command. This aligns well with the broader PMI mode standardization across the codebase.Also applies to: 105-105
executorlib/task_scheduler/interactive/fluxspawner.py (1)
38-38: LGTM! Clean parameter standardization.The renaming from
flux_executor_pmi_modetopmi_modeproperly standardizes the PMI mode parameter across the codebase. The implementation correctly updates the constructor parameter, internal attribute, docstring, and usage in thebootupmethod.Also applies to: 54-54, 69-69, 112-113
executorlib/standalone/inputcheck.py (1)
149-153: LGTM! Function properly generalized for unified PMI mode support.The rename from
check_flux_executor_pmi_modetocheck_pmi_modeand the updated error message correctly reflect the broader applicability to both SLURM and Flux backends. This aligns well with the PMI mode standardization effort.tests/test_standalone_inputcheck.py (1)
16-16: LGTM! Test properly updated to match function renaming.The test correctly reflects the renaming from
check_flux_executor_pmi_modetocheck_pmi_modeand uses the newpmi_modeparameter. The test logic remains intact and continues to validate that specifying PMI mode raises a ValueError on local workstations.Also applies to: 80-82
tests/test_slurmjobexecutor.py (3)
17-23: LGTM! Well-structured MPI test function.The
mpi_functfunction correctly importsmpi4pylocally and returns the expected tuple format for testing parallel execution.
30-30: Good addition of PMI mode configuration.Adding
--mpi=pmi2to theslurm_cmd_argsproperly tests the new PMI mode functionality for serial execution.
38-46: Comprehensive parallel execution test.The test correctly configures the executor for parallel execution with 2 cores and validates that the MPI function returns results from both ranks. The assertion properly checks for the expected list of tuples format.
notebooks/3-hpc-job.ipynb (1)
122-122: Parameter renaming is consistent and correct.The change from
flux_executor_pmi_modetopmi_modealigns with the broader standardization effort across the codebase.executorlib/executor/slurm.py (6)
47-47: Good documentation of PMI mode parameter.The docstring clearly explains the purpose of
pmi_modeand mentions the OpenMPI v5 requirement for pmix.
95-95: Consistent parameter addition in constructor.The
pmi_modeparameter is properly added to theSlurmClusterExecutorconstructor signature.
179-179: Parameter propagation is correct.The
pmi_modeparameter is properly passed through to thecreate_file_executorcall.
285-285: Consistent implementation across executor classes.The
SlurmJobExecutorfollows the same pattern asSlurmClusterExecutorfor thepmi_modeparameter addition.
400-400: Factory function updated consistently.The
create_slurm_executorfunction signature properly includes thepmi_modeparameter.
454-454: Parameter correctly added to resource dictionary.The
pmi_modeis properly stored in the resource dictionary for downstream use.executorlib/executor/flux.py (5)
46-46: Parameter renamed for consistency.The standardization from
flux_executor_pmi_modetopmi_modeimproves consistency across the codebase.
96-96: Constructor signature updated correctly.The
FluxJobExecutorconstructor properly uses the newpmi_modeparameter name.
178-178: Parameter propagation maintained.The
pmi_modeparameter is correctly passed through to thecreate_flux_executorcall.
473-473: check_pmi call updated correctly.The function call now uses the standardized
pmi_modeparameter name instead of the oldflux_executor_pmi_mode.
482-482: Resource dictionary assignment updated.The resource dictionary now correctly uses the
pmi_modekey for consistency.executorlib/standalone/command.py (3)
20-25: Good function signature extension.The addition of
backendandpmi_modeparameters with proper optional typing extends functionality while maintaining backward compatibility.
32-33: Clear documentation of new parameters.The docstring properly documents the new parameters, including the backend options and PMI mode usage.
40-66: Well-structured backend handling logic.The implementation correctly handles different backends:
- Default behavior preserved for
backend=None- Proper command construction for slurm with
--mpi=<pmi_mode>- Correct flux command with
-o pmi=<pmi_mode>- Appropriate error handling for invalid backends
The logic is clean and maintains the existing command structure while adding new functionality.
executorlib/task_scheduler/interactive/slurmspawner.py (3)
9-20: LGTM! Good defensive programming practice.The changes safely check for the presence of SLURM environment variables before accessing them, preventing potential KeyError exceptions when these variables are not set.
34-34: LGTM! Proper parameter addition for PMI mode support.The new
pmi_modeparameter is correctly added with appropriate type annotation, documentation, and instance variable storage.Also applies to: 48-48, 60-60
81-81: LGTM! Proper parameter propagation.The
pmi_modeparameter is correctly passed to thegenerate_slurm_commandfunction.executorlib/task_scheduler/file/task_scheduler.py (3)
10-10: LGTM! Import updated for unified PMI mode support.The import change from
check_flux_executor_pmi_modetocheck_pmi_modealigns with the broader refactoring to unify PMI mode handling across different backends.
37-37: LGTM! Proper PMI mode parameter integration.The
pmi_modeparameter is correctly added to the constructor with appropriate type annotation, documentation, and propagation to the process kwargs for task execution.Also applies to: 52-52, 73-73
89-89: Approved: PMI mode conditional logic verifiedThe addition and propagation of the
pmi_modeparameter are correct. Thecheck_pmi_modefunction (which raises ValueError whenpmi_modeis non-null) is only invoked underbackend is None, enforcing the “no PMI on a local workstation” rule as intended. No further changes required.
| cache_directory=cache_directory, | ||
| max_cores=max_cores, | ||
| resource_dict=resource_dict, | ||
| pmi_mode=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded None value in plot dependency graph path.
In the FluxClusterExecutor, when plot_dependency_graph is True, pmi_mode is hardcoded to None instead of using the constructor parameter. This may not be the intended behavior.
Apply this diff to fix the issue:
- pmi_mode=None,
+ pmi_mode=pmi_mode,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pmi_mode=None, | |
| pmi_mode=pmi_mode, |
🤖 Prompt for AI Agents
In executorlib/executor/flux.py at line 390, the parameter pmi_mode is hardcoded
to None when plot_dependency_graph is True in FluxClusterExecutor. To fix this,
replace the hardcoded None with the pmi_mode value passed to the constructor so
that the actual intended mode is used instead of always None.
Summary by CodeRabbit